An Improved Hierarchical Lossless Text Compression Algorithm

نویسندگان

  • Chia-Yuan Teng
  • David L. Neuhoff
چکیده

Several improvements to the Bugajski-Russo N-gram algorithm are proposed. When applied to English text these result in an algorithm with comparable complexity and approximately 10 to 30% less rate than the commonly used COMPRESS algorithm. I. The N-Gram Algorithm The N-gram algorithm of Bugajski and Russo [1] is a hierarchical dictionary-type universal lossless source coder for a finite source alphabet. There are several variations, but they share the same basic idea which is described as follows: There is a set of L dictionaries D1, D2, D3, ... , DL, where the level i dictionary, Di, contains words of length 2i-1, D1 contains all possible single characters, and Di ⊂ Di-1 × Di-1, i = 2, ..., L. To encode a source sequence (x1,x2, ...), one finds the largest i such that (x\s( ,1), x\s( ,2), ..., x\s( ,2i-1)) ∈ Di and sends the level identifier i and the index of (x ּ 1 ,x ּ 2 , . . . ,x ּ 2i-1 ) in Di. The level identifier i is encoded with a prefix code and the index is encoded with mi = log |Di| bits, where |Di| denotes the size of the set Di and where all logarithms in this paper have base 2. Thus an N-gram algorithm is a kind of variablelength to variable-length code. The effect of the hierarchical coding is to parse (x1,x2,...) into words whose lengths is a power of two. There are two strategies for developing the dictionaries: one-pass and two-pass. In a one-pass method, the dictionaries are developed while the source sequence is being encoded, as in the LZ78 algorithm [2]. In a two-pass method, a long source sequence is first scanned for the purpose of developing the dictionaries and then the source sequence is encoded as described previously. The dictionaries must themselves be encoded and sent along with the level identifiers and dictionary indices. In this paper, we focus only on two-pass methods. The encoding rate, or simply rate, R of a two-pass method is the summation of the index, dictionary, and level identifier rates; i.e.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Novel Lossless Text Compression Technique Using Ambigram Logic and Huffman Coding

The new era of networking is looking forward to improved and effective methods in channel utilization. There are many texts where lossless data recovery is vitally essential because of the importance of information it holds. Therefore, a lossless decomposition algorithm which is independent of the nature and pattern of text is today's top concern. Efficiency of algorithms used today varies grea...

متن کامل

Lossless Compression of Volumetric Medical Images with Improved 3-D SPIHT Algorithm

This paper presents a lossless compression of volumetric medical images with the improved 3-D SPIHT algorithm that searches on asymmetric trees. The tree structure links wavelet coefficients produced by three-dimensional reversible integer wavelet transforms. Experiments show that the lossless compression with the improved 3-D SPIHT gives improvement about 42% on average over two-dimensional te...

متن کامل

Lossless Compression of a Desktop Image For Transmission

We have presented a desktop image compression algorithm for real-time applications such as remote desktop access by desktop image transmission. The desktop image is called as a complex image, because one 800 X 600 true color image has a size of approximately 1.54 MB with pictures and text. The algorithm is called as group extraction and coding (GEC). Real-time image transmission requires that t...

متن کامل

(0, 1)-Matrix-Vector Products via Compression by Induction of Hierarchical Grammars

We demonstrate a method for reducing the number of arithmetic operations within a (0, 1)matrix vector product. We employ an algorithm, SEQUITUR, developed for lossless text compression, which generates a context free grammar derived from an inherent hierarchy of repeated sequences. In this context, the sequences are composed of bit patterns for a set of adjacent columns. This grammar will repre...

متن کامل

Lossless Compression Algorithm for Hierarchical Ic Layout

Lossless Compression Algorithm for Hierarchical IC Layout

متن کامل

Coefficient Statistic Based Modified SPIHT Image Compression Algorithm

Among all wavelet transform and zero-tree quantization based image coding algorithms, set partitioning in hierarchical trees (SPIHT) is well known for its simplicity and efficiency. But theoretical analysis and experimental results have shown there are still some key points need to be further improved. This paper proposes a coefficient Statistic based Modified SPIHT Lossless Image Compression A...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995